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Abstract 



In the budgeted learning problem, we are allowed to experiment on a set of alternatives (given a fixed 
experimentation budget) with the goal of picking a single alternative with the largest possible expected 
payoff. Approximation algorithms for this problem were developed by Guha and Munagala by rounding 
a linear program that couples the various alternatives together. In this paper we present an index for 
this problem, which we call the ratio index, which also guarantees a constant factor approximation. 
Index-based policies have the advantage that a single number (i.e. the index) can be computed for each 
alternative irrespective of all other alternatives, and the alternative with the highest index is experimented 
upon. This is analogous to the famous Gittins index for the discounted multi-armed bandit problem. 

The ratio index has several interesting structural properties. First, we show that it can be computed 
in strongly polynomial time. Second, we show that with the appropriate discount factor, the Gittins 
index and our ratio index are constant factor approximations of each other, and hence the Gittins index 
also gives a constant factor approximation to the budgeted learning problem. Finally, we show that the 
ratio index can be used to create an index-based policy that achieves an 0(l)-approximation for the 
finite horizon version of the multi-armed bandit problem. Moreover, the policy does not require any 
knowledge of the horizon (whereas we compare its performance against an optimal strategy that is aware 
of the horizon). This yields the following surprising result: there is an index-based policy that achieves 
an 0(l)-approximation for the multi-armed bandit problem, oblivious to the underlying discount factor. 
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1 Introduction 

The classical multi-armed bandit problem provides an elegant model to study the tradeoff between collecting 
rewards in the present based on the current state of knowledge (exploitation) versus deferring rewards to the 
future in favor of gaining more knowledge (exploration^) \2\. Specifically, in this model a user has a choice 
of bandit-arms to play, and at each time step it must decide which arm to play. The expected reward from 
playing a bandit-arm depends on the state of the bandit-arm where the state represents a "prior" belief on the 
bandit-arm. Each time a bandit-arm is played, this prior gets updated according to some transition matrix 
defined on the state space. For instance, a typical assumption on the bandit-arms is that they have (a, /?)- 
priors: the success probability of an (a, 0) -bandit-arm is a/ (a + /?); in case of a success a reward of 1 is 
obtained and a gets incremented, whereas in case of a failure no reward is obtained and (5 gets incremented. 
The user wishes to maximize the total expected discounted reward over time. This simple setting effectively 
models many applications. A canonical example is exploring the effectiveness of different treatments in 
clinical trials while maximizing the benefit received by patients. 

The discount factor in a multi-armed bandit problem may be viewed as modulating the horizon over 
which the strategy explores to identify the bandit-arm with maximum expected reward, before switching to 
exploitation. This facet of the multi-armed bandit problem is explicitly captured by the budgeted learning 
problem, recently studied by Guha and Munagala lfl~8l . The input to the budgeted learning problem is the 
same as for the multi-armed bandit problem, except the discount factor is replaced by a horizon h. The goal 
is to identify the bandit-arm with maximum expected reward using at most h steps of exploration. The work 
of |[T8l gives a constant factor approximation for the budgeted learning problem via a linear programming 
based approach that determines the allocation of exploration and exploitation budgets across the various 
arms. The budgeted learning problem is the main object of study in this paper. 

The multi-armed bandit problem admits an elegant solution: compute a score for each bandit-arm using 
only the current state of the bandit-arm and the discount factor, independent of all other bandit-arms in the 
system, and then play the bandit-arm with the highest score. This score is known as the Gittins index, and 
many proofs are known to show this is an optimal strategy (e.g., see [9]). The optimality of this "index- 
based" strategy implies that this problem exhibits a "separability" property whereby the optimal decision 
at each step is obtained by computations performed separately for each bandit-arm. This structural insight 
translates into efficient decision making algorithms. In fact, for commonly used prior update rules and 
discount rates, extensive collections of pre-computed Gittins indices exist, enabling in principle, a simple 
lookup-based approach for optimal decision-making. There are multiple definitions of what it means for a 
problem to have an "index". We will use the term index in its strongest form, i.e., where the index of an arm 
depends only on the state of that arm. This is also sometimes called a decomposable index (eg. EJIU). 

The inherent appeal and efficiency of index-based policies is the unifying theme underlying our work. 
We show that many interesting and non-trivial variations of the multi-armed bandit problem, including the 
budgeted learning problem and the finite horizon problem, can all be well-approximated by index-based 
policies. Moreover, our approach gives decision strategies that are oblivious to parameters such as the 
underlying horizon or the discount factor while being constant-factor competitive to optimal strategies that 
are fully aware of these parameters. 

1.1 Our results 

We will study this problem when the state space of each arm satisfies the "martingale property", i.e., if we 
play an arm multiple times, the sequence of expected rewards is a martingale. This is a natural assumption 
for multi-armed bandit and related problems, e.g. the commonly used (a, (3) priors satisfy this property. 

'We will use the terms experimentation and exploration interchangeably in this paper, depending on the context. 
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An Index for Budgeted Learning Problems: Our first result is that the budgeted learning problem admits 
an approximate index, which we call the ratio index. Informally speaking, given a single bandit-arm and 
an exploration budget of h steps, the ratio index for that arm is the maximum expected exploitation reward 
per unit of the exploration and exploitation budget utilized. The ratio index suggests the following natural 
algorithm: at each step, play the arm with the highest ratio index. We show that this simple greedy algorithm 
gives a constant factor approximation to the budgeted learning problem. An 0(1) -approximation algorithm 
for this problem is already known lfl8l . However, the algorithm of ifTHl is based on solving a coupled LP 
over all the arms, whereas the ratio index can be computed for each arm in isolation, much like the Gittins 
index. The ratio index has many other interesting properties. For example: 

(1) We show that the Gittins index with discount factor (1 — l/h) and the ratio index over horizon h are 
within a constant factor of each other. This gives the following surprising result: 

Theorem 1.1 Given an exploration budget h, playing at each step the arm with the highest Gittins index, 
with discount factor 1 — l/h, yields a constant factor approximation to the budgeted learning problem. 

The proof relies on comparing the "decision-trees" of the ratio index and Gittins index strategies. Even in 
retrospect, it is not clear to us how such a result could be derived using an LP-based formulation such as the 
one used by Guha and Munagala Ifl8l . Interestingly, the policy described in theorem [T7T] is known to often 
work well in practice [23]. Nonetheless, before the work of Guha and Munagala lfT8l . we do not know of 
any provable guarantees for polynomial time algorithms in this setting. And until now, we don't know of 
any formal guarantees that relate the exponential discounting approach (which yields the Gittins index) and 
the budgeted learning approach. 

(2) The ratio index can be computed in time which is strongly polynomial in the size of the state space 
(independent of h) of each arm if the state space is acyclic, and strongly polynomial in the size of the state 
space and h if the state space is general. Our proof of this fact involves recursively analyzing the basic 
feasible solutions of an underlying LP for computing optimum single arm strategies and using the structure 
of the basic feasible solutions to prove that these strategies have a simple form. 

Finite Horizon and Discount-Oblivious Multi-Armed Bandits: We next study an important and natural 
variation of the budgeted learning problem, called the finite horizon multi-armed bandit problem. We are 
given a finite horizon h, and the goal is to maximize the expected reward collected during the horizon. 
Thus, in contrast to the budgeted learning problem, the horizon h is being used for both exploration and 
exploitation, and no payoffs are obtained after time h. We show the following result using the ratio index: 

Theorem 1.2 There is an index-based policy that gives a constant factor approximation to the finite horizon 
multi-armed bandit problem. 

Finally, we study the role of the discount factor in the design of an optimal strategy for the exploration- 
exploitation tradeoff. Small variations in discount factors can alter the choice of bandit-arm played at any 
step, highlighting the sensitivity of the Gittins index to the discount rate. We study the "Discount-oblivious" 
multi-armed bandit problem where the underlying discount factor is not known, and in fact, may even vary 
from one time step to the next. A finite horizon problem can be viewed as a special case of this general 
setting where the discount factor is 1 for the first h steps and is for all subsequent steps. There is a 
useful relationship between the finite horizon and discount oblivious versions of the multi-armed problem: 
a strategy is K-approximate for the discount-oblivious multi-armed bandit problem iff it is ^-approximate 
{simultaneously) for all finite horizons. Using this connection, and building on Theorem |1.2| we show the 
following result: 
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Theorem 1.3 There is an index-based policy that gives a constant factor approximation for the multi-armed 
bandit problem with respect to all possible discount factors simultaneously. 

Our proof of both of these results is based on the following easy consequence of the ratio index approach to 
the budgeted learning problem. For any constant 0, the expected profit of the optimal /i//3-horizon strategy 
is an f2( Infraction of the expected profit of an optimal /i-horizon strategy. Using this result, we design 
an algorithm that alternates between budgeted exploration and exploitation, using geometrically increasing 
horizons; each increasing horizon competing against a lower discount rate on future rewards. It is worth 
noting that this result can also be shown using the LP-based proof of Guha and Munagala. However, the 
following corollary is a consequence of our index-based approach and the relation between ratio and Gittins 
indices. 

Corollary 1.4 The strategy that alternates between exploring the arm with the highest Gittins index, and 
exploiting the arm with the highest reward, in phases of geometrically increasing length ( and discount 
factor 1 — 1/t during a phase of length t) provides a constant factor approximation to the multi-armed 
bandit problem simultaneously for all finite horizons and for all discount factors. 

1.2 Related Work and Organization 

There are many sources for the canonical work on Gittins indices, particularly with reference to (a, j3) 
bandits and Bernoulli bandit processes iTUl [TT1 [T2l l9l . Glazebrook and others have studied approximation 
algorithms for other extensions to multi-armed bandit problems |[T3l H4l . Their approach builds upon the 
concept of achievable regions and general conservation laws and a related linear programming approach built 
by Tsoucas, Bertsimas, Nino-Mora, and others HJ|3]|25]]. Relaxed linear programming based approaches 
to extensions of the multi-armed bandit problem have also been developed, e.g. for restless bandits Il27l l4l. 
Our work on the ratio index builds on the insights obtained from the LP relaxation based approach of 
Guha and Munagala |[T8l as well as related work in model-driven optimization lPT5l [171 and stochastic 
packing OH] 02 COS. Additionally, related LP formulations have been developed for multi-stage stochastic 
optimization j6ll24l. 

In the theoretical computer science community, multi-armed bandits have primarily been studied in an 
adversarial setting, with the goal being to minimize the regret (see for a nice overview). A typical 
guarantee in these settings is that the total regret after T steps grows as 0(y/TN) where N is the number of 
alternatives, assuming the partial information model (i.e. only the reward for the alternative that is actually 
played is revealed), which corresponds well to our setting. These results assume no prior beliefs, unlike our 
decision theoretic framework. However, the regret based bounds in the adversarial setting are meaningless 
unless T > N. The decision theoretic framework which has a rich history (starting perhaps with Wald's 
work in 1947 [26 ]) is more suited to the situation where the number of exploration steps is drastically limited, 
as is often the case. A typical setting, for example, is one where an advertiser that can advertise on 100,000 
possible phrases and is willing to pay for 100 clicks to decide which keyword attracts visitors that convert 
into paid customers. So a traditional regret based bound may not be very meaningful in this setting. 

In section [2] we define the budgeted learning problem and the ratio index, and prove that the ratio index 
is a constant factor approximation to the budgeted learning problem. Section [3] establishes that the Gittins 
and ratio indices are constant factor approximations of each other. We also show here that playing the arm 
with the largest Gittins index (with a suitable discount factor), gives a constant factor approximation to the 
budgeted learning problem. Section[4]presents index-based policies for finite horizon and discount oblivious 
versions of the multi-armed bandit problem. In section [5} we present a strongly polynomial algorithm to 
compute the ratio index as well as several useful insights into its structural properties. 
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2 The Budgeted Learning Problem and the Ratio Index 

2.1 The Budgeted Learning Problem. 

We are given n arms. Arm i has state space Ti, with initial state p{. Experimenting on an arm i in state 
u G Tj results in the arm entering state v 6 Tj with known probability P uv . The payoff of state u is given 
as Ci u )- Given an experimentation budget h, we are interested in finding the optimal policy, ir*, so that 
E^* [maxjgj! n } C( v i)] maximum among all policies, where V{ is the state of arm i after the policy has 
been executed (the number of experiments cannot exceed h). 

We will use T to denote UjTj. For convenience, we will assume that the T, are disjoint and that P uv = 
if u and v are in the state spaces of different bandit-arms; this can be easily enforced by duplicating any 
shared states. The initial states represent a prior belief on the payoff from the bandit-arms. We will assume 
that the expected payoff is a martingale, i.e., ((u) = YlveT PuvC{v); the martingale assumption is crucial 
to our results. We will also assume without loss of generality that the state space of any arm is acyclic and 
truncated at depth h. 

The martingale property has some useful and easy consequences which we will use repeatedly: 

1. For an arbitrary policy let p(t) denote its expected payoff if it is terminated after t experiment steps. 
Then, p(t) is non-decreasing in t. In other words, extra experiments can never hurt. 

2. Given a single arm, no policy can have a higher expected payoff than the one which does no explo- 
ration and simply chooses the initial state as the winner; in other words, extra experiments can never 
help given just one arm. 

The proof of the following theorem is deferred to appendix [A] It is conceivable (though not obvious to us) 
that this theorem can also be obtained via the "indexability characterization" of 0. In any case, the proof 
is quite elementary and provides useful intuition. 

Theorem 2.1 There is no exact index for the budgeted learning problem. 

2.2 The ratio index 

We will now define the ratio index, which is an approximate index for this problem. At any given time, the 
current state of the system is denoted by S = {u\, U2, • • • , u n , 5}, which captures the current states of all 
the arms, and the budget left (i.e. the number of experimentation steps that are still remaining), 5. The initial 
state of the system has all the arms in their initial states, and 5 = h. Since we use the term state for both 
the system and an arm, we will disambiguate where necessary by referring to these as "system-state" and 
"arm-state" respectively. A policy n is a function which takes as input a system-state S and either returns 
an arm i for experimentation (i.e. explores the arm-state m), or chooses an arm i as a winner and terminates 
(i.e. exploits the arm-state Uj), or simply terminates (abandons). If 5 = then the only options are to 



abandon or exploit. The martingale property (see the comment at the end of section 2.1 ) implies that there 
always exists an optimal policy which explores some arm iff 5 > and exploits some arm iff 5 = 0. We 
now introduce two vectors x n and z w . The probability that arm- state u is the final exploited state by policy 
7r is given by The probability that arm-state u is explored by policy -k is given by We define the cost 
of policy 7r as 

cm = + x>s- 

Observe that C(ir) < 2, for any policy tt. The profit of policy it is defined as 

7>(tt) = 5>SC(«). 
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Observe that our definition of policy is an adaptive one; the decisions made in step j > 1 depend on the 
entire system-state at time j and hence on the outcome of previous experimentation steps. Further, it is easy 
to see that randomized strategies can not do any better than deterministic strategies. 

If we drop the requirement that a policy must either exploit or abandon when the remaining budget 8 is 
0, we obtain what we call a pseudo-policy. A single arm policy is one which makes all its decisions based 
only on the state of a single pre-determined arm i, ignoring all other arms. We are now ready to define the 
ratio index and prove that it leads quite naturally to an approximation of the budgeted learning problem. 

Ratio Index. The ratio index r(u,h) of a bandit-arm (say arm i) in initial state u and with experimentation 
budget h, is defined as 

V(ir) 

max 7vTv 
7T L{7r) 

where the max is over all single arm pseudo-policies ir which have initial arm-state u, budget h, state space 
T{, and cost C(ir) > 0. We refer to a policy which yields the ratio index as a ratio index policy for state u, 
denoted ir r (u, h). 

Even though we allow pseudo-policies in the definition of the ratio index, any ratio index policy respects 
the budget constraint: 

Lemma 2.2 Any ratio index policy for state u has cost at most 1. 

Proof: Because of the martingale property, no single arm policy starting from arm-state u can obtain profit 
more than C( n )- Hence, any single arm policy ir that has cost more than 1 must have a smaller ratio (of 
expected profit to expected cost) than the single arm policy which exploits in state u. ■ 
Greedy Algorithm. Suppose the initial experimentation budget is h, and the current system-state is given 
byS = {u±,U2, • • • , u n , 5}. If 5 > 0, the greedy algorithm explores the arm i with the maximum ratio index, 
r(nj, h), with ties broken arbitrarily but consistently. If5 = the greedy algorithm exploits the arm i with 
maximum current expected reward C{ u i)- We denote the greedy algorithm by G. 

Note: The greedy algorithm uses the same h at every step to compute the ratio index. Hence, given a table 
of the ratio index of every state in T (which can be pre-computed efficiently as specified in the section [5]>, 
we can implement this algorithm using a simple min-heap and the complexity of each step would be just 
0(log n), which is much better than solving a coupled LP with 3nh variables. 

2.3 Analysis of the greedy algorithm 

We now show that the greedy algorithm gives an 0(l)-approximation to the budgeted learning problem. 

Lemma 2.3 A ratio index policy for arm-state u, ir r (u, h), does not abandon any arm-state v with r(v, h) > 
r(u, h) and does not explore or exploit any arm-state v with r(v, h) < r(u, h). 

The proof of the above lemma is deferred to appendix [D] (as corollary |D.6[ ). Now consider the following 
algorithm, which we call the persistent algorithm, denoted G'\ 

The persistent algorithm G': Given a system-state S, let i be the arm with the highest ratio index r(ui, h) 
where m denotes the current state of arm i. Play arm i in accordance with the policy ir r (ui, h) until 
the policy chooses to exploit or abandon. If n r (ui, h) abandons, let S' be the resulting system-state. 
Repeat the process starting with S'. If at any time, the system-state is such that 5 = 0, immediately 
exploit the arm that has the highest current ratio index. 

Observe that as for the greedy algorithm, the ratio index used by the persistent algorithm G' is computed 
using a fixed budget h; the number of remaining exploration steps 5 is used only to terminate G'. 
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Lemma 2.4 The expected profit of the greedy algorithm G is at least as much as the expected profit of the 
persistent algorithm G'. 



Proof: Couple the greedy and the persistent algorithms such that if they both explore an arm in a given state, 
that arm transitions to the same state for both strategies. Let I = (ii, 12, ■ . . , i K ) denote the sequence of arms 
explored by G' before exploitation; here k < h since G' can exploit early. Let J = ■ ■ ■ ,jh) denote 



the set of arms explored by G. By Lemma 2.3 we can conclude that I is a prefix of J. By the martingale 
property, early termination can never result in increased profit; the lemma follows. ■ 
Thus it suffices to analyze G' . Given two single arm pseudo-policies tt and tt for an arm i, we say that 
tt y 7r, if for all arm-states u 6 Tj for which tt explores (exploits) arm i, tt also explores (exploits) the arm i. 
Notice that tt might choose to continue exploration/exploitation when tt abandons an arm-state. Informally, 
tt ^ tt means that policy tt can be played after policy tt has been played to completion. We will now state a 
useful technical lemma; the proof is in appendix |A| 

Lemma 2.5 Given two arbitrary single arm pseudo-policies tt, tt' for arm i in initial arm-state u, there 
exists another single arm pseudo-policy tt starting in the same initial arm- state u such that, (1) n >z n, (2) 
C(tt) - C(tt) < C(tt'), and (3) V{tt) > V{tt'). 

The above property is akin to submodularity. We now state our main theorem, which says that the greedy 
algorithm G gives a constant factor approximation to the optimal policy. Let B^ih, S) denote the expected 
profit obtained by strategy tt for the budgeted learning problem run with budget h and initial system-state 
S, and let B*(h, S) denote the expected profit obtained by an optimum strategy with the same parameters. 
We will omit the system-state when it is the same for all the strategies involved. 

Theorem 2.6 B G (h) > 0.225* (/i) 



Proof: From Lemma 2.4 it suffices to analyze the persistent algorithm G' rather than the greedy algorithm 
G. We divide the persistent algorithm into stages, starting from stage 1. Let i\ be the arm with the highest 
ratio index at the beginning of stage 1 (and hence the arm that will be played by G' at the first step). Since 
the arms evolve probabilistically, the first stage (as well as subsequent stages) will result in a distribution 
over system-states. Let Sj denote the system-state at the start of stage j, and let Vj denote the distribution 
of Sj. Let Uj be the arm-state with the highest ratio index among the arm-states which have a non-zero 
probability, say jj, in T>j, and let ij be the corresponding arm. The j-th stage of G' is to simply move to 
the next stage if the arm ij is not in state Uj (which happens with probability 1 — jj); we call this stage 
"empty" in this case. If the arm ij is in state Uj, then the j-th stage of G' is to mimic an optimum ratio index 
policy for state Uj. If the exploration budget gets exhausted during this mimicking process, then the j-th 
stage exploits arm ij right away and the policy terminates; the cost of the extra exploitation is not charged 
to this stage of the policy. By the martingale property, this early termination can only increase the expected 
profit of the j-th stage. If the j-th stage exploits an arm, then the persistent algorithm terminates as well. 

Let TTj denote the policy corresponding to the j-th stage. Let pj and Cj be the cumulative expected profit 
and expected cost of the first j stages. Use A p (j) and A c (j) to denote the expected profit and the expected 
cost of the j-th stage, conditioned on this stage being played (i.e. the j-th stage being non-empty and the 
persistent algorithm not terminating before reaching the j-th stage). The following statement is a corollary 



of Lemma 2.5 the proof is a digression from the current theorem and is deferred to appendix |A| 



Corollary 2.7 At the beginning of stage j, there exists a single arm pseudo-policy with profit to cost ratio 
at least (V(w*) -pj-\)/2. 
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Since the persistent algorithm follows an optimum ratio index policy we are guaranteed that A p ( j) / A c ( j ) > 
(V(tt*) — Pj-i) /2. By Markov's inequality, the probability that the budget has not been exhausted before 
stage j starts is at least 1 — Cj-\. Also, recall that jj is the probability that the j-th stage is non-empty. The 
expected unconditioned profit of the j-th stage is at least 7^(1 — Cj_i)A p (j). The expected unconditioned 
cost of the j-th stage is at most 7jA c (j). Hence, we get 

Pj-Pj-i > Pfo-*) - Pj-i n r \ 
^"7- 2 

Thus, the profit obtained by the persistent algorithm is more than the one attained by the following differen- 
tial process, where p is the cumulative profit and c is the cumulative cost, and p* = V(tt*) (view the process 
as increasing the expected cost from to 1): 

dp p*-p ( > 
d~c = — (1 " C) - 

Integrating from c = to c = 1 , we get that the expected profit is at least (1 — e~°' 25 )p* > 0.22p*. ■ 
Thus, we have shown the existence of a simple index which yields almost as good an approximation ratio as 
the LP-based approach of Guha and Munagala. The results above assume that each exploration step has the 
same cost, but can easily be extended to the weighted exploration cost case. We can also modify the proof 
slightly to obtain the following corollar>|^J 

Corollary 2.8 B*(h/2) > 0.17 B*(h). 

Combining Corollary |2.8| and Theorem 2.6 we obtain the following corollary: 

Corollary 2.9 B G {h/2) =U(B*(h)). 

3 Relating the Gittins and Ratio indices 

We will use S$ to denote a standard (stationary) bandit-arm with a fixed reward of 5. We will use A to 
denote a given bandit-arm in some initial state u. A Gittins index strategy S takes as input an arm A with 
an unknown reward distribution (but a known initial state) and a standard bandit-arm S$ for some 5 > 0, 
and gives a strategy for maximizing the discounted reward for a multi-armed bandit with A and S$ as its 
two bandit-arms. Thus each node in the decision tree of S is labeled as playing either the given arm A or 
the standard bandit Ss- We can assume w.l.o.g. that once the strategy S plays the standard bandit at a node 
in the tree, it plays it forever from here onwards. The Gittins index of an arm A is defined to be the least 
8 such that the Gittins index strategy with input arms A and S$ is indifferent between playing either one of 
them at time 0. We will assume u to be the initial state of A in the remainder of this section, and drop its 
explicit mention. Let r(h) denote the ratio index for A when the horizon is limited to h. Let p{6) denote the 
Gittins index for A when the discount factor is uniform for some < 9 < 1. The following lemmas show 
that the Gittins and the ratio indices are constant factor approximations of each other. The proofs involve 
transforming the Gittins index strategy to the ratio index strategy (and vice verse) and are in appendix [B| 

Lemma 3.1 For any h>2, p{9) > r{h) (l - \) where 6 = (1 - \). Thus as h ^ 00, p{6) > r{h)/e. 
Lemma 3.2 For any h>2, p(9) < (2 + 4e)r(h) where 6 = (1 - \). 



2 This corollary can also be obtained using the LP-based framework of Guha and Munagala 1 18 1. 
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Lemma 3.3 Let Pi(t) denote the Gittins index of arm i at time t, where the discount factor 9 is 1 — 1/h. 
Playing the arm with the highest value of pi{t)for t = 1, 2, . . . , h and then picking the arm with the highest 
expected payoff at time t results in a constant factor approximation to the budgeted learning problem. 

While the constants in our proofs are large, the algorithms are simple and intuitive. For instance, Schnieder 



and Moore E31 and Madani, Lizotte and Greiner [21] have studied the policy defined in lemma 3.3 and 
other similar policies, and found that they often work well in practice. 

4 Finite Horizon and Discount Oblivious Multi-armed Bandits 

In the traditional multi-armed bandit problem, we are given a fixed discount factor 9 £ (0, 1) and allowed 
to play one arm at each time. If the reward at time t is r(t) then the total discounted reward is Ylt>o #* r (0- 
Always playing the arm with the currently highest Gittins index maximizes the expected total discounted 
reward; however the Gittins index of an arm depends crucially on the parameter 9. In this section, we discuss 
both finite horizon and discount oblivious versions of the multi-armed bandit problem. 

In the finite horizon multi-armed bandit problem, we are given a fixed number of steps, h, as in the 
budgeted learning problem. However, unlike the budgeted learning problem, the objective of the finite 
horizon problem is to maximize the total (undiscounted) expected reward obtained during the first h steps. 
This models many important problems such as optimally placing bets with a fixed number of chips, and 
optimally assigning impressions to advertisers |2"2l . 

In the discount oblivious multi-armed bandit problem, we want to find a strategy that provides a constant 
factor approximation to the optimum reward for all 9 £ (0,1) simultaneously. It is not clear up front that 
such a strategy exists. In fact, we will allow the discounts to be even more general. Let A = (Ao, Ai, A2, ... 
be an infinite sequence of discount factors that satisfies the property 1 = Ao > Ai > A2 > ... and where 
At — > as t — > 00. We will call such a sequence a discount factor sequence. Let the system-state S denote 
the vector of all the arm-states. We will use D 7r (A, S) to denote the total expected discounted reward of 
any strategy it for discount factor sequence A starting from system-state S. If strategy it obtains reward r(t) 
at time t when stalled in initial system-state S, then D n (A, S) = Ylt^o ^* r (0- Setting A t = 9 l leads to 
the standard multi-armed bandit problem. Setting A t = 1 for t < h and A t = otherwise leads to a fixed 
horizon problem where we only get the reward from the first h time steps. 

We will use F w (h, S) to denote the total (undiscounted) expected reward over a window of h steps of 
any strategy it, starting from S. We will use D*(A, S), F*(h, S) to denote the optimum values for the two 
problems. We will omit the parameter S when it is the same for all strategies under discussion. All proofs 
are deferred to appendix [C| 

4.1 An approximate index for the finite horizon problem 

Recall (from section|2]) that Bc(h) and B*(h) denote the expected profit of the greedy algorithm (which 
always explores the arm with the largest ratio index) and the optimum strategy respectively, for the budgeted 
multi-armed bandit problem. We first relate the budgeted learning and finite horizon problems: 

Lemma 4.1 For any positive integer h, we have [|] • B* (|_f J) < F*(h) < h ■ B*(h). 

We will now define two index-based strategies for the finite horizon problem, assuming horizon h and initial 
system-state S:. 

1. For the first [h/2\ steps, play the arm with the highest ratio index, where the ratio index is computed 
assuming a budget of [h/2\. For the remaining \h/2] steps, play the arm with the highest expected 
reward. We will denote this strategy as RatioSwitch(/i, S) since it switches from using the ratio 
index (in the first half) to using the expected profit as an index in the second half. 
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2. Similarly, define GittinsSwitch(/i, S) as the strategy which plays the arm with the highest Gittins 
index (assuming a discount factor 1 — l/\h/2\) during the first [h/2\ steps, and then switches to 
using the arm with the highest expected reward. 

Observe that both strategies use an index at each step, and the choice of index does not depend on the state 
of the system; it only depends on the time step. As before, we will omit the system-state parameter S when 
is it is the same for all strategies under discussion. 



To the best of our knowledge, this is the first index for the finite horizon problem with provable approxima- 
tion guarantees. It would be interesting to obtain a smooth version of GittinsSwitch(/i) which does not 
need to make the discrete jump from a discount factor of 1 — 2/h in the first half to playing the arm with the 
highest expected reward (i.e. to a discount factor of 0) in the second half. 

4.2 An approximate index for the discount oblivious problem 

We will first establish a connection between the discount oblivious and finite horizon problems and then use 
this connection to obtain a simple index-based approximation algorithm for the discount oblivious problem. 

Lemma 4.4 For any k, a strategy gives a K-approximation simultaneously for all discount factor sequences 
A iff it gives a K-approximation simultaneously to the fixed horizon problems with all horizons h>0. 

Let RatioScale be the following discount oblivious strategy: play in sequence the strategies RatioSwitch(1, So) , 
RatioSwitch(2, Si), RatioSwitch(4, S3), RatioSwitch(8, S7), . . ., where each RATioSwiTCH(2 fc ) 
is started from the state of the system after time 2 k — 1, denoted S 2 k_ 1 ; this is the state in which the arms 
are left by the previous RATIOS WITCH strategy. So is the initial state of the system. 

Like RatioSwitch, RatioScale is also an index-based strategy; the index used at any time step t de- 
pends only on t. Analogously, GittinsScale plays the sequence GittinsSwitch(1, So), GittinsSwitch(2, Si), 
GittinsSwitch(4, S 3 ), GittinsSwitch(8, S7), • • - 

Since the state of the system at the start of RatioSwitch(2 1 ) depends on the outcomes of the previous 
steps, the following technical lemma, which is an easy consequence of the Martingale property, will be 
useful. This lemma states that performing an arbitrary sequence of extra explorations at the beginning 
cannot hurt the optimum solution for the budgeted learning problem. Observe that the state T is itself a 
random variable in this lemma; the expectation is over all values of T. 

Lemma 4.5 Let tt\ be any arbitrary finite sequence of explorations starting from system-state S. Let T be 
the system-state at the end of tt\. Let 7T2 be an optimum h step strategy for the budgeted learning problem 
starting from the system-state T. Then E[i? 7r2 (/i, T)] > B*(h, S). 

Using the above lemma, we can show the following: 

Lemma 4.6 For any positive integer h > 1, the expected reward of the discount oblivious strategy RATIOS C ALE 
in the first h steps is 0,(F*(h, So))- 

Lemma 4.7 For any positive integer h > 1, the expected reward of the discount oblivious strategy GittinsScale 
in the first h steps is Q(F*(h, So)). 



Theorem 4.2 *R AT ioSWrTCH(/#) = V(F*(h)). 



Combining lemma 3.3 with the proof of theorem |4~2 





we obtain: 



Theorem 4.3 ^GittinsSwitciW^) = n{F*(h)). 
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Invoking lemma 4.4 now gives us: 



Theorem 4.8 Strategies RatioScale and GittinsScale both give a constant factor approximation to 
the multi-armed bandit problem simultaneously for all discount factor sequences A. 

5 Computing the Ratio Index 

We will now sketch how the ratio index can be computed. In the process, we will also get several useful 
insights into its structural properties. Given a single bandit-arm i, an initial state p for i, an exploration 
budget of h, and a state space Tj truncated to depth h, we view Tj as a layered DAG of depth h, which is to 
say that for any arm-state, u, in layer j, if P uv > 0, then v must be in layer As explained in section |5Tj 
this is without loss of generality. We let S be the number of nodes in the layered DAG. Additionally, for any 
state u in T ir we use Tf to denote the sub-DAG of with root u; thus Tf = Tj. 

For the purposes of this section, we require the use of randomized single arm policies. Whereas a 
deterministic single arm policy (corresponding to arm-state v) will always either explore v, exploit v, or 
abandon with probability 1, a randomized policy, it, selects e v ,p v : e v ,p v > 0,e v + p v < 1 where e v 
represents the probability it explores in this state, p v represents the probability 7r exploits in this state, 
and 1 — e v — p v represents the probability it abandons in this state. The vectors x n and z 71 are defined 
for randomized policies as for deterministic policies, as are the profit V(ir) and cost C(tt) of the policy. 
Our approach below will calculate the ratio index r(u, h) for all u G Tj as well as the entire profit curve 
V u {') for all u where V U {C U ) = max 7r 7 :, (7r) where the max is over all randomized single arm policies 
7r with initial state u and C(ir) < C u . We show that there exists a deterministic policy that induces the 
maximum V U {C U )/C U over all C u > (and in fact our algorithm will find such a policy). Thus, the value 
max.V u (C u ) /C u is the ratio index for u given h, i.e., r(u, h). Our algorithm relies heavily upon the following 
theorem on the structure of the profit curve. 

Theorem 5.1 The profit curve, V u (-), far any given state u is concave and piecewise linear with at most 
2Tj u segments where T, u represents the number of states in Tf. 

The proof of this theorem involves several steps and is deferred to appendix [D] Towards proving the 
theorem, we show that as the budget increases along the profit curve for u, a monotonicity property holds 
that for every state v € Tf, both p v and e v + p v are non-decreasing. 

Lemma 5.2 For any C^ 1 ), C^ 2 \ with > there exist optimal solutions (eW,pOO) an( ^ ^(2)^(2)^ 
to LP U (C^) and LP U (C^) respectively such thatpffl < p^ and +p^ < +p v 2 ^ for all v in Tf. 

We further characterize the intersection of line segments of the profit curve as "corner" solutions and 
show that at these points p v G {0, 1} and e v G {0, 1} for all states in Tf. Thus, these points of the curve 
are induced by deterministic policies. Thus, the policy which induces the "corner" solution at the end of the 
first segment of the profit curve is a deterministic ratio index policy. 

5.1 Algorithm for Computing the Profit Curve 

The algorithm for computing the profit curve (and hence the ratio index) involves recursively calculating the 
profit curve for a state u given the profit curves for all of its successor states. We begin by constructing an 
exploration profit curve for u, X u {-), which denotes the optimal profit for any given cost conditioned on the 
fact that we are exploring at u (i.e. e u = 1). We then take the concave envelope over this curve combined 
with the abandonment policy and the exploitation policy. Figure [T] in appendix [D] shows a typical example 
of the relationship between these two curves. 
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Superficially, it might seem that the number of segments of the profit curves could increase exponen- 
tially as we perform this process up the DAG. However, Theorem 5. 1 guarantees that the number of segments 
remains bounded and the entire curve for u can be computed in time 0(<iE u logE u ) given the successor 
curves, where d represents the maximum number of immediate descendants for any node. Thus, this algo- 
rithm is strongly polynomial (in E) for computing the entire profit curve of a state in the layered DAG, and 
hence, the ratio index. If the underlying state space of the bandit-arm is an unlayered DAG, we can make it 
layered by multiplying the number of states by at most S, so the algorithm is still strongly polynomial in E. 
If the underlying state space is not a DAG, we can convert it into a layered DAG by multiplying the number 
of states by at most h. Details of the algorithm and the analysis are in appendix [Pj 
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A The Budgeted Learning Problem 

We will now provide the missing proofs from section [2} 



Proof of Theorem 2.1: Consider three arms A,B,C with (a, (3) priors: = 5, Pa = 4, as = 
28, (3b = 19, olc = 28, Pc = 19. Assume the horizon h is just 1. 

First consider the scenario where arms A and B are the only ones present. Observe that 28/48 > 
5/9. So if we play arm B once and the trial is a failure, arm B will still be more profitable than arm 
A. Hence, playing arm B gives an expected profit of 28/47 = 0.5957 since arm B will be chosen as 
the winner regardless of the outcome. Also, 28/47 < 6/10 so if we play arm A once and the trial is a 
success, arm A becomes more profitable than arm B. Hence, exploring A first gives an expected profit of 
(5/9) x (6/10) + (4/9) x (28/47) = 0.5981. Therefore, if there exists an index for the budgeted learning 
problem, the index of arm A must be higher than that of arm B. 

Now consider the scenario where arms A, B,C are all present. Playing arm A first gives the same 
expected profit as before: 0.5981. If we play arm B, and the trial is a failure, arm C will be chosen as the 
winner, giving an expected profit of (28/47) x (29/48) + (19/47) x (28/47) = 0.6008. Hence, arm B 
must have a higher index than arm A, which is a contradiction. ■ 

Proof of Lemma \2.5\ Without loss of generality, assume that the state space Tj is a tree. Define tt as 
follows: tt makes the same choices as tt for all arm-states which are either explored or exploited by tt. This 
ensures that condition 1 in the lemma is satisfied. Further, the total cost of tt on these states is merely the 
total cost of TT. 

For an arm-state u that is abandoned by tt, tt does the following: 

(a) If any ancestor of u in Tj gets abandoned by tt' , then tt abandons u. 

(b) If any ancestor of u gets exploited by tt', then tt exploits u. 

(c) Else, tt makes the same choice as tt' on u and all the descendants of u. 

In order to prove condition 2 in the lemma, we have to bound (by charging to tt') the cost incurred (by 
tt) in exploring/exploiting those arm-states u (and their descendants) that are abandoned by tt. In case (a) 
above, u is abandoned and no extra cost is incurred. In case (b), let v be the ancestor arm-state of u that 
was exploited by tt'. The cost x\\ of exploiting u is the cost x^ of exploiting v times the probability of 
tt reaching u conditioned on tt reaching v. Since the total probability that tt abandons a descendant of v 
conditioned on tt reaching v is at most 1 , the incremental cost for tt for all descendants of v can be no more 
than x 7 ^ and thus there can be no overcharging. In case (c), the charging is quite straight-forward since tt 
just mimics tt' . 

In order to prove condition 3 in the lemma, consider any arm-state u that is exploited by tt' . If that state 
is exploited by tt, then it is also exploited by tt. If that state is explored by tt, then eventually, tt must either 
abandon or exploit a descendant along every path in the state space starting from u. Descendants that get 
abandoned by tt will get exploited by tt by property (b). By the martingale property, the total profit obtained 
by tt from all the descendant states of u is the same as the profit obtained by tt' from u. If the arm-state u is 
abandoned or not reached by tt, then tt mimics tt' according to property (c). Hence, tt gets at least as much 
profit as tt'. ■ 



Proof of Corollary 2.7 : Let tt* denote the optimum policy. Define a new policy that we call the 
restriction of tt* to arm i, denoted Lj, as follows: Lj follows tt* when it explores/exploits arm i, and 
simulates tt* (without really playing it) on other states. A simple coupling argument shows that the total 
expected cost of all these single arm policies is equal to the expected cost of tt* , and the total expected profit 
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of all these single arm policies is equal to the expected profit of tt*. Similarly, let M denote the greedy 
algorithm over the first j — 1 stages, and let N{ denote the restriction of M to arm i. 

The following is now immediate: ^ — ^ = V(n*) —Pj-i and ^ = C(tt*) < 2. Hence, 



there exists some i such that {V^ — 'PN i )/CL t > C^C 71 "*) — Pj'-i)/2. Applying Lemma 2.5 with iV, serving 
the role of 7r and L» serving the role of it' proves the corollary. ■ 

Proof of Corollary |2.8[ If we take the integral in the proof of the above theorem with c going from 
to 1/2 instead of from to 1, we get an approximation factor of (1 — e -3 / 16 ) « 0.17. Hence, the optimum 
reward from the budgeted learning problem with budget h/2 is at least 0.17 times the optimum reward with 
budget h. ■ 

B Relating the Gittins and ratio indices 

We will now prove the lemmas from section [3j 



h 



Lemma B.l (same as Lemma 3.1 1 For any h>2, 

p{6) > r{h) (l 

where 6 = (1 — r). Thus as h — > cc, p{9) > r(h)/e. 

Proof: Consider a strategy Si that computes the ratio index r(h) for arm A with horizon limited to h. 
Let r = r(h). We will modify the strategy Si in to another strategy 52 that will certify that the Gittins index 
for arm A is at least r/c for some suitable constant c > 1. The decision tree for strategy S2 is obtained by 
modifying the decision tree for Si as follows: 

1. When the decision tree for strategy Si visits a node labeled "Abandon", the tree for £2 changes it to 
playing the standard bandit-arm S r / C and terminates this branch of the tree. 

2. When the decision tree for strategy Si visits a node labeled "Exploit", the tree for £2 replaces it to 
playing the arm A forever. 

Let 7 denote the expected reward of the bandit-arm A conditioned on exploitation by the strategy <Si. 
Also, let bn and br denote the exploration and the exploitation budget, respectively, of the strategy <Si. Thus, 
by definition, the ratio index is given by 

r = . 

b R + b T 

We can now lower bound the difference A in the expected reward generated by the strategy £2 over the 
standard arm S r / C to be at least 

r\ „l v— \ „„■ . . /r 



A>b T (j--)e h ^e i -b R h 

Substituting 7 = ^±^, we get 

A > ((&* + b T )r - b T (£)) ^ - b R h (£) > (b R hr) (»>' 1 
Thus A > if c > (1 - \)~ h , establishing that p > r (l - \) h . 
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Lemma B.2 (same as Lemma 



3.2 1 For any h>2, 



p{6) < 18r(h) 

where 6 = (1 - \). 

Proof: Let r = r(h) and p = p(0). The optimum discounted reward strategy given a bandit-arm A and 
the standard bandit-arm S p is indifferent at time between playing the two bandit-arms. Let Si be a Gittins 
index strategy for arm A and the standard bandit-arm S p such that S\ plays A at time 0. Each node in the 
decision tree of S\ is labeled as either playing the standard bandit S p or playing the arm A. 

We will modify the strategy S\ into another strategy S2 that will certify that the ratio index for arm A 
is at least p/c when horizon is restricted to h; here c > 1 is a suitable constant to be specified later. Note 
that unlike strategy S\, the strategy £2 does not have access to the standard bandit S p . However, since the 
standard bandit S p is in a stationary state, when we say a state s in S%, we will assume that it refers only to 
the state of the arm A. The decision tree for strategy £2 is obtained by modifying the decision tree for S\ as 
follows: 



1 . When the decision tree for strategy S\ visits a state that is labeled as playing S p and the depth is at 
most h, the strategy £2 abandons. 

2. When the decision tree for strategy Si visits a state s labeled as playing the arm A, then 

(a) If the depth is less than h, then the strategy £2 explores if the expected reward of playing the 
arm A in state s is less than 2p/c, and it exploits otherwise. 

(b) if the depth is exactly h, strategy £2 exploits. 

Clearly, the strategy £2 as defined above has its horizon restricted to h. 

We will use label (s) and depth(s) to denote respectively the label and depth of a state s in the decision 
tree of strategy £2- A state s is assigned a label of "R" if £2 explores in this state, and a label of "T" if it 
exploits in this state. Finally, we will denote the probability of reaching a state s in Si by pi(s), and the 
reward at a state s in the strategies Si and £2 by tti(s) and 7r2(s) respectively. Note that vri(s) = ^(s) for 
any state s where strategy Si plays the arm A. 

The expected reward 7 collected by £2, as well as the exploration and the exploitation budget of £2, 
denoted by b R and 6^ respectively, are given by 



7= Pl( s >2(s) b R = ^2 b T = Pl ^- 

{seS 2 I label(»=T} {seS 2 | label(s)=_R} {s&S 2 I label(s)=T} 

We now lower bound the total reward deficit that is accumulated by the strategy Si in the states that are 
labeled as "exploring" in the strategy £2, as compared to the reward of playing the standard bandit S p . 



E Pi(s)(p-m(s))9 d ^ = £ pi(s)(p-7r 2 (s))9 d ^ >^(p 

1 i-,u.„i/'„\ — di r „/- c_ I / „\ — Dl \ 



{s£S 2 I label(s)=_R} {seS 2 | label(s)=i?} 

since <9 de P th ( s ) > 1/4 for depth(s) < h. 



2p 
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On the other hand, the total reward surplus accumulated by the strategy S\ above the reward of playing 
the standard bandit S p in the subtrees of the states labeled as "exploiting" in 52, can be upper bounded by 

£ PitoOnto-p) <h\ £ Pi(*)*s(«)]- 

{se5 2 | label(s)=T} \i>0 J \{seS 2 | label(s)=T} / 

The bound above follows from the martingale property which implies that for any state s, we must satisfy 
the relationship 7Ti(s) < iT2(s) + p. To see this, observe that for any state s, by martingale property we have 
^(s) = YlveN(s) Psv^2{v), where N{s) denotes the set of descendants of a states s, and P sv denotes the 
probability of transitioning from state s to state v when arm A is played in states s. On the other hand, 

tti(s) = £ P s *,max{7r 2 (/u),p} = £ P st) • 7T 2 (s) + £ P S1 , • p < 7T 2 (s) +p. 

veN(s) {veN(s) | 7r 2 (i))>p} {»eW(s) I 7r 2 (u)<p} 

Since the total reward surplus in S± w.r.t. the standard bandit S p must be at least as large as the total 
reward deficit w.r.t. to S p , it follows that 

£ Pi(s)tt 2 (s) \> h ^[p 



K {s£S 2 | label(s)=T} 



C 



Thus 



Let 6 Tl = '£{ses 2 | iabei«=r and ^ 2 (s)>2p/ c } Pi( s )> and & T 2 = &T - 6 Tl - Clearly, 

7>(y)*Ti- (2) 

Finally, we observe that the Gittins index policy S\ never plays the bandit-arm A in a state s if it ever 
plays the standard bandit in any ancestor of state s. Therefore, for any i < h, 

{seS 2 | label(s)=R and depth(s)=i} 

Averaging over all i < h, we get 



h T 2 <\[ £ Pi(s))=b R . (3) 

\{seS 2 | label(s)=R} 



We can now lower bound the ratio index r by the expected reward to exploration and exploitation budget 
ratio achieved by the policy 5 2 : 
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r > 



> 



> 



> 



7 

bR + b Tl + b T2 

7 

2b R + b Tl 

7 

87c . JC 
(c-2)p "T" 2p 



8c 1 c 
c-2 " r 2 



(using Q) 
(using Q and @) 



> - (when c > 18) 



Lemma B.3 (same as Lemma 3.3 1 Let Pi(t) denote the Gittins index of arm i at time t, where the discount 
factor 9 is 1 — 1/h. Playing the arm with the highest value of piit) for t = 1,2, ... ,h and then picking 
the arm with the highest expected payoff at time t results in a constant factor approximation to the budgeted 
learning problem. 



Proof: Consider the strategy S2 described in the proof of Lemma 3.2 above. By the proof of Lemma 3.2 



the profit to cost ratio of this strategy is U(p(6)), which by Lemma 3.1 is fi(r(/i)). We will call S2 the 



truncated Gittins policy. Using the truncated Gittins policy instead of the ratio index policy in the proof of 



theorem 2.6 we can conclude that repeatedly choosing the arm with the highest Gittins index and playing the 
truncated Gittins policy for this arm (the entire policy, not just for a single step) till it abandons or exploits 
gives a constant factor approximation to the budgeted learning problem. We will call this the TG algorithm. 

Now consider the strategy which repeatedly plays the arm with the highest Gittins index for one step 
and then at time h, picks the arm with the highest payoff; we will call this the Gittins strategy. The Gittins 
strategy is identical to the TG algorithm as long as the TG algorithm continues exploring some arm. The TG 
algorithm may choose to exploit before the entire budget is exhausted, in which case making an arbitrary 
set of additional exploration steps can not hurt (by the martingale property). The TG algorithm may also 
choose to exploit an arm with a smaller current expected payoff than some other arm; again, choosing the 
arm with the highest expected current payoff can not hurt. In either case, the expected profit of the Gittins 
strategy is no worse than that of the TG algorithm, and hence, the Gittins strategy is also a constant factor 
approximation to the budgeted learning problem. ■ 

C Finite Horizon and Discount Oblivious Multi-armed Bandits 

We will now provide the proofs from section [4] 

Proof of Lemma |4.1[ Consider the fixed horizon strategy (for horizon h) that first solves the budgeted 
learning problem with budget \h/2\ and then exploits the winner from this budgeted learning problem for 
the remaining [7t/2] time steps. This strategy has expected pay off at least |~|] • B* ([fj) and hence, 
F*(fc)>[fl •£*([!]). 

Now consider the budgeted learning strategy (with budget h) that emulates the optimum fixed horizon 
strategy (for horizon h) but only for t steps, where t is an integer chosen uniformly at random from the set 
{0, 1, . . . , h — 1}, and then declares the arm that the optimum fixed horizon strategy was about to play in 
step t + 1 as the winner. This budgeted learning strategy has expected payoff exactly F*{h)/h and hence, 
F*(h) <h- B*(h). m 
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Proof of Theorem 



4.2 



Assume, for simplicity, that h is even; the odd case is very similar. Now, 



^RATloSwrrCHpoW - (h/2)B G (h/2) 

= tt{hB*{h)) [from lemma Eg) 



n(F*(h)) [from lemma |4JJ 



Proof of Lemma 4.4 : The "only if" part is trivial since a fixed horizon problem can be modeled using 
a discount factor sequence, as explained earlier. For the "if" part, consider a strategy ir that offers expected 
reward r(t) at time t. Then the expected discounted reward, D n (K) of 7r given a discount factor A can be 
written as Yl'hLo ~~ X^=o r (*)) ■ Observe that — hh+i > by the definition of a discount 

factor sequence. If we simultaneously ^-approximate each of the finite horizon rewards Ylt=o r (*) we 
also simultaneously K-approximate any non-negative linear combination, and hence the optimum expected 
discounted reward for any discount factor sequence. ■ 



Proof of Lemma 4.6: If h = 1, RATIOS WITCH(1, So) simply plays the arm with the highest expected 
profit, and hence obtains profit F*(l). If h = 2, it is easy to see that F*{2) < 3F*(1) and hence the lemma 
holds. We will now assume that h > 3. 

Let t be the largest power of 2 which is no larger than (h + 1) /2; hence t > h/4. Since h > 3, we have 
t > 2. The strategy RatioScale is guaranteed to execute the strategy RATloSwiTCH(t, St-i) sometime 
during the first h steps (in fact during steps £,..., 2i — 1). Using lemma [43] we can repeat the steps in the 



proof of theorem 4.2 to claim that E[F RATIoSwiTCH(t St l \(t, S t -i)] = Cl(F*(t, So)) = £l(F*(h, S )). 



The proof of Lemma [477] is similar to the above proof. 
D Calculating the profit curve 



We will now present the full details of the proof of theorem 5.1 as well as the full algorithm to compute the 
profit curves and ratio indices for all states up to depth h for a given bandit-arm. 

D. 1 Proof of Theorem 



5.1 



Below, we prove a series of claims that together imply theorem 5.1 We begin by considering two methods 
of calculating V U (C U ) that will be used in our discussions. The first is a recursive equation that can be used 
to calculate V U (C U ) for a given state u and budget C u provided that we have the entire profit curves of all 
successor states. This equation is 

Vu(C u ) = max p u ((u) + e u } P UV V V {E^) 



p u ,e u ,E u 



v<=D(u) 



the constraints are: 



Pu + e u < 1 

(i/h+ PuvK)<c u 

v<=D(u) 

Pu,e u ,E u > 

D(u) is the set of immediate descendants of u. The decision variables p u and e u represent the probability 
of exploiting and exploring in u respectively (as in the definition of a randomized policy). The vector E u 
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represents the budgets we would allocate to each of the immediate descendants of u should we visit them. 
Recall that P uv is the probability of transitioning to state v given we are experimenting in state it. We assume 
EH = Oif v i D(u). 

Alternatively, the following LP (LP U (C U )) (similar to the one in lfT8l ) for a given state u of bandit-arm 
i reveals a policy, (w, x, z) which induces V U (C U ) for a given C u . 



s.t. 



Y,y;veD(y) Z V P W = ™V G T" \ {«} 
IU« = 1 

Xt, + < u;„ Vu G T" 

a?„,z«,>0 VdG^ 1 

For any state v G T", u;„ represents the probability the bandit-arm enters v, x v represents the uncondi- 
tional probability of exploiting in v, and z v represents the unconditional probability of experimenting in v. 
Given the stochastic nature of the P matrix and the fact that all of the z values are less than or equal to their 
corresponding w values, we can see that each element of w will be bounded between and 1. Thus, x and 
z will also be automatically bounded above by 1. Thus, we do not need further constraints bounding these 
variables. 

Using both the recursive equation and the LP, we can show the following. 

Claim D.l V u {-) is a concave, nondecreasing function and ifC u > 1, V U {C U ) = V u (l) = Q{u). 

Proof: Given any C u and associated policy {p u , e u , E u } for the recursive equation that realizes V U (C U ), 
observe first that this policy is also feasible for any cost greater than or equal to C u . Thus, V U (C U ) is a 
nondecreasing function. 

We prove the remainder of the claim by induction. Looking at V U (C U ) for a state u that is at depth h 
it is easy to see that, V U (C U ) = C u * ((u) if C u < 1 and £(it) if C u > 1. Thus, at depth h if C u > 1, 
Vu(C u ) = Pu(l) = C(u) and V U Q is concave. 

Now assume these properties hold for all states at depth i + 1 and look at a state u at depth i. Our 
profit is obviously non-decreasing in each of the decision variables {p u , e u , E u }. Set = 1 \/v G D(u). 
From the induction hypothesis, J2veD(u) PuvV v (l) = J2 v eD(u) p uvC( v ) and b Y the martingale property, 
12veD(u) PuvCi v ) = C( u )> so our objective becomes maxp u £(u) + e u C,(u) or equivalently max(p u + 
e u )C{u). Since (p u + e u ) < 1, this can be no larger that C( n )- But clearly ((u) can be achieved with a cost 
of 1 by setting p u = 1 and e u = 0. Henceforth, V U (C U ) = £(u) VC U > 1. 

With respect to concavity, for any two points on the profit curve of u, V U {C^) and V U {C^), corre- 
sponding to < C^ 2 ) let (w W, x^\ z^) and (w^ , x^ 2 \ z^) be their associated policies respectively 
as defined on LP U (C U ). Consider the policy ( w(1) + w(2) , xW + x(2 \ zW + z(2) ). This policy is feasible for the 

problem of finding the profit associated with budget c ' w + c ' {2) , so V u { c(1) + c(2) ) > ^ veT u ^ (i(v) = 
— !£i ' 2 uy '- . Thus the profit curve for u is concave. 

■ 

With respect to LP U (C U ), we define the vectors e and p where e v = z v /w v and p v = x v /w v if w v > 0. 
Otherwise, we require only that e v , p v > and e v + p v < 1. Thus et, and p v are the conditional probability 
of exploring and exploiting, respectively, given that we are in state v. Thus, the two vectors (e,p) define 
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a randomized policy that induces a point on V U Q- Alternatively, we could define the same policy with the 
three vectors (w, x, z). In what follows, we will freely interchange between the two notations. Note that the 
one thing we must be careful to observe is that for any policy and state where w v = 0, there are infinitely 
many equivalent assignments of e v and p v . 

Appealing to linear programming theory with respect to LP U (C U ), we can derive several interesting 
properties of the profit curve. We begin with the following: 

Claim D.2 For any C u £ [0, 1] there exists an optimal policy with respect to LP U {C U ) such that p v £ {0, 1} 
and e v £ {0, 1} for all but at most one state in Tf. 

Proof: Let's consider a basic feasible optimal solution to LP U (C U ), (w*,x*,z*). We know such a 
solution exists since the LP is bounded and a basic feasible solution to the LP exists (for instance the solution 
that corresponds to setting x v = z v = Vt> ). Let us create Tf by removing all states v from Tf for which 
x u = z u = 0- (Note: Since we may be removing some children of states remaining in Tf, the martingale 
property may no longer hold with respect to Tf.) Let us create LP U (C U ) by replacing Tf in LP M (C U ) with 
Tf. LP U {C U ) will have the same objective value as LP U (C U ) and an optimal solution (w*,x*, z*) such that 
w* = w* and x* = x* for every state v G Tf. 

Let us define the number of states in Tf as E u . LP U (C U ) has 3S U variables, 2E U non-negativity con- 
straints, and 2S U + 1 other constraints. Thus LP U (C U ) has a basic feasible optimal solution, which will have 
at least S u — 1 variables equal to zero. (For discussion of LP theory and the role of basic feasible solutions, 
see ll20l or a similar resource.) 

If exactly S u — 1 variables are equal to zero, all constraints of the type x v + z v < w v must be tight. By 
virtue of how we created Tf , we know that each of these zero variables must correspond to a distinct state. 
Thus, for these S M — 1 states either x* = u>* > or z* = w* > 0, and for the only remaining state, (call it 
y), x* > and z* > 0. 

Alternatively there could be S u variables equal to zero (but no more since for all states x* + z* > 0), in 
which case, for at least S n — 1 states either x* = w* > or z* = w* > 0. 

Looking at this in terms of p and e, in at least S u — 1 states either p% = 1 or e* = 1. For all states v in 
Tf /Tf we could arbitrarily assign p v and e v , so this property holds with respect to LP U (C U ) as well. 

■ 

From the analysis above, we can see that for a given C u , LP U (C U ) has an optimal basic feasible solution 
(w*, x* , z*) that takes one of the following three forms: 

1. For every state v where x* + z* > 0, x* + z* = w*, and in exactly one state y, x* > and z* > 0. 
In this case exactly 31^ constraints of LP U (C U ) are binding (including the cost constraint). 

2. For every state v where x* + z* v > 0, either x* = or z* = 0, and in exactly one state y, x* + z* < w* 
In this case exactly 3S U constraints of LP U (C U ) are binding (including the cost constraint). 

3. For every state v where x* + z* > 0, either x* = or z* = 0, and x* + = w*. In this case either 
3S U + 1 constraints of LP U (C U ) are binding, or else the cost constraint is not binding which implies 
the slope of the profit curve at this point is zero. 

In the first two types of policies, we will call the state y for which it does not hold that p*, e* £ {0, 1} 
the transitional state. Note that policy type 3 does not have a transitional state, and thus corresponds to a 
deterministic policy. 
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Further leveraging our understanding of the basic feasible solutions of LP U Q and linear programming 
theory, we make the following pair of claims. 

Claim D.3 Let (e* ,p*) be a basic feasible optimal policy for LP U {C U ). If {e*,p*} has a transitional state, 
then there exists an e > such that there exists an optimal policy (e*^\p*^) for LP u (Cu^) where G 
[C u — e, C u + e] that has the same transitional state y and for all states v ^ y e*^ = e* and p%^ = p%. 

Proof: This result follows directly from linear programming theory. Given that we have a transitional 
state, we have a non-degenerate solution to LP U () (with exactly 3S U constraints at equality). Thus, small 
changes in the budget will result in a solution that has the same set of binding constraints. These results 
carry over to LP U (). (We refer the reader to [20] or another suitable optimization text for more discussion 
of linear programming theory and sensitivity analysis.) 

With respect to our problem, this implies that for every non-transitional state v, if x* = 0, x*^ = 0, 
if z* = 0,z* (1) = 0, if x% = = io* (1) , and if z* = = u>* (1) . Or equivalently, 

p *(l) _ p * „*(!) = n* 

■ 

This leads naturally to the following result: 

Claim D.4 The profit curve is piecewise linear. Further, each "corner" solution or point connecting two 
segments of the curve can be achieved by a deterministic policy. 

Proof: Again by linear programming theory, beginning at a non-degenerate solution to LP U {) and 
making incremental changes to the budget will change the optimal objective value at a constant rate (the 
shadow price of the budget constraint) until a new constraint becomes binding. With respect to LP U (C U ) 
this implies that if the optimal solution to this LP has a transitional state, then there exist some < C u 
and > C u such that the segment of the profit curve from P U {C^) to V U {CW) is linear. 

Furthermore, since we know that each end of this line segment must have an additional binding con- 
straint, the optimal policy associated with these points on the curve must be of type 3 above (i.e. a deter- 
ministic policy). 

■ 

This result, combined with the fact that profit curves are concave, implies that for any state, u, the ratio 
index can always be calculated with infinitesimal cost, i.e. r(u,h) = V' +u (0), where V' +U (C U ) denotes 
the right-sided derivative of the profit curve evaluated at C u . Further, the policy at the end of the first line 
segment of V u (-) is a deterministic ratio index policy. 



Before we prove Lemma 5.2 we will need to establish the following four properties. In the following, 
recall that V' +U (C U ) represents the right-sided derivative of V u () evaluated at C u . Correspondingly, V'__ U (C U ) 
represents the left-sided derivative of V u () evaluated at C u . 

Claim D.5 There exists an optimal solution (e* ,p*) to LP n (C), such that for any state v where > 0: 

1. if P * v >o,vL v (i)>VL u (C) 

2. Ife* v >0,V' +u (C)>VL v (l) 

3. IfeZ>0,r(v,h)>VL u (C) 

4. Ifl-e*-p*>0,V' +u (C)>r(v,h) 
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Proof: First, consider the set of all optimal solutions to LP U (C). Arbitrarily observe one optimal 
solution (e,p). Use S to denote the set of all optimal solutions (e,p) to LP U (C) such that w v = w v . Select 
(e*,p*) G S such that p* = max,5 p v , p* + e* = max,s p v + e v . We now prove the four properties. 

1. If p* > 0,7^(1) > V'_ U {C): Given v such that w* v > and p* > assume 7* „(1) < V'_ U {C). 
Denote by E* the amount of budget devoted to v and its descendants. Thus, we know that total profit 
garnered from v and its descendants is V V (E*). Let (e,p) v be the policy that induces V V (E* — e) on 
T£. Let (e,p) be a policy that follows (e,p) v in the subtree of v and follows (e*,p*) otherwise. The 
policy (e, p) is a feasible solution to the problem of finding the point on the profit curve of u with cost 
C — w*e. This policy has profit V u (C) — wleVL v (E*). Further, asp* > 0, we know that either p* = 1 
(in which case E* = 1) or else v is the transitional state for LP V (E*). In the latter case, one end of 
the line segment of V V Q which contains V V (E*) must have p* = 1. This obviously corresponds to 
V v (l). Thus, V V (E*) must be on the last segment of the profit curve of v, so V'_ V (E*) = V'_ V {1). 
Thus, 

V U (C - w* v e) > V U (C) - w* v eV'_ v (l) 
>V u {C)-w* v eP'_ u {C) 

Thus, V U {C) — V U (C — w*e) < w*eP' u {C). By concavity, this cannot be true, so it must be true that 

PLvO-) > VL U (C). 

2. If el > 0,T" +u (C) > V'_ v (l): Given v such that w* > and e* > assume V'_ v (l) > V' +U {C). 
Denote by E* the amount of budget devoted to v and its descendants. Observe that it must be the 
case that E* < 1. Otherwise there exists some other optimal solution (e*^,^ 1 )) to LP U (C) with 
p* v ^ = 1, e^ 1 ^ = 0. Thus, we know that total profit garnered from v and its descendants is V V (E*). 
Let (e,p) v be the policy that induces V V (E* + e) on Let (e,p) be a policy that follows (e,p) v 
in the subtree of v and follows (e*,p*) otherwise. The policy (e,p) is a feasible solution to the 
problem of finding the point on the profit curve of u with cost C + w*e. This policy has profit 
P U (C) + w* v eP' +v {E* v ). Thus, 

V U (C + w*e) > V U (C) + w*eP' +v (E* v ) 
Further, as E* < 1, V' +V {E*) > V'_ V {1). Thus, 

P u (C + w* v e)>P u (C) + w*eV'_ v (l) 



=> V U {C + w*e) - V U (C) > w* v eV' u (C) 
By concavity, this cannot be true, so it must be true that V' +U (C) > V_ v (l). 

3. If e* > 0,r(v,h) > V'_ U (C): Given v such that w* > and e* v > assume r(v,h) < VL U (C). 
Denote by E* the amount of budget devoted to v and its descendants. Thus, we know that total profit 
garnered from v and its descendants is V V (E*). Let (e,p) v be the policy that induces V V (E* — e) on 
Let (e,p) be a policy that follows (e,p) v in the subtree of v and follows (e*,p*) otherwise. The 
policy (e, p) is a feasible solution to the problem of finding the point on the profit curve of u with cost 
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C - w*e. This policy has profit V U (C) - w*eV'_ v (E*). Further, as E* > (since e* > 0), we know 
thatV'_ v (E*) < r(v,h). Thus, 



V U (C - w*e) > V u iC) - w* v er(v, h) 
>V u {C)-wleP_ u (C) 

Thus, V U (C) — V U (C — w*e) < w*eV' u (C). By concavity, this cannot be true, so it must be true that 

VL v (l)>VL a (C). 

4. If 1 — e* - pi > 0,V +u (C) > r(v, h): Given v such that w* > and 1 - e* - p* > assume 
r(v, h) > V +U (C). Denote by E* the amount of budget devoted to v and its descendants. Thus, we 
know that total profit garnered from v and its descendants is V V {E*). Let (e,p) v be the policy that 
induces V V (E* + e) on T£. Let (e,p) be a policy that follows (e,p) v in the subtree of v and follows 
(e*,p*) otherwise. The policy (e,p) is a feasible solution to the problem of finding the point on the 
profit curve of u with cost C + w*e. This policy has profit V U (C) + w*eP' +v (E*). Thus, 



V U {C + w*e) > V U {C) + w*eV' +v (E*) 

Further, as 1 — e* — > 0, V' +V (E*) = r(v, h). (Since neither e* = 1 nor p* v = 1, at least one must 
be zero and the other is either zero or the transitional state. Thus, at one end of this segment of V v {) 
must be the abandonment policy, so we are on the first segment of the profit curve of v.) Thus, 

V U {C + w*e) > V U {C) + w* v er(v, h) 

=> V U (C + w*e) - V U {C) > w*eP' u (C) 
By concavity, this cannot be true, so it must be true that V +U (C) > r(v, h). 



This result directly implies Lemma 2.3 which we now prove. 



Corollary D.6 (same as Lemma 2.3 1 A ratio index policy for arm-state u, ir r (u, h), does not abandon any 



arm-state v with r(v, h) > r(n, h) and does not explore or exploit any arm-state v with r(v, h) < r(u, h). 



Proof: With respect to Lemma D.5 select C such that < C < C(-K r (u,h). Thus, V'_ U (C) 



V +U (C) = r(u, h). since VL V (1) < r(v, h), the first, third and fourth properties proved in Lemma D.5 
respectively imply that for the optimal randomized policy corresponding to this point on the profit curve: 
(l)ifr(u, h) > r(v,h),ihenpl = 0; (3) if r(u, h) > r(v, h), then e* = 0; and (4) if r(v, h) > r(u,h), then 
1 — e* — p* = 0. As v cannot be a transitional state under any of these conditions, then by the argument in 



Claim D.4 these properties must hold for the ratio index policy as well. ■ 
We are now ready to prove the following monotonicity result, which follows from the concavity of the 
profit curve. 



Claim D.7 (same as Lemma 5.2 1 For any state u and and C'( 2 \ where < < < 1, 



there exist optimal solutions (e^ ,p^) and (e^ 2 \p^) to LP U (C^) and LP U (C^) respectively such that 
Pv < P^ and + p^ < + p^ for all v in Tf. 
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Proof: Given the four properties from Lemma D.5 we can see that for any state v, as the slope of V U {C) 



decreases with increasing C, there will be up to three regions where we would first abandon at v, then we 
would explore at v, then we would exploit. The only remaining technical issue that remains is if the slope 
of the profit curve of u exactly equals VL V (1) or r(v). In these cases, we are of necessity on a line segment 
of V u () where v is at some point the transitional statej^] In these cases, at the two endpoints of the segment 
we must have one of the three following pairs of values for v: 

1. p v = e v = 0;p v = 1 

— • Pu — &v — . G-v — 1 

3. e v = l;p v = 1 

For the first two of these cases, it is obvious that the state on the right conesponds to higher cost. Thus, our 
monotonicity property holds. For the third case, we know by the martingale property that setting p v = 1 
yields maximum possible profit at the node. Thus, the monotonicity property must hold in this case as well. 

■ 

We index the B u "corner" solutions on V U (C U ) as Si, sb u corresponding to the budget associated with 
their rightmost end point. The above claim further implies that if for every state v £ T" and every "corner" 
solution, Si on V U (C U ) we apply a label L Si (v) G {^4, E, P} where A corresponds to "abandoning" at v 
{p v = e v = 0), E corresponds to "exploring" (e» = 1), and P corresponds to "exploiting" (p v = 1), then for 
any v and any i,j such that i < j, if L s .(v) = P, then L Sj (v) = P, and if L Si (v) = E, then L s .(v) ^ A. 
Thus, we can find a set of solutions such that as we increase the budget, once a state is labeled "P" it will 
always remain a "P" and once it becomes an "E" it will never become an"A". Thus, each state can only 
change labels at most twice. As every successive "corner" solution must involve changing the label of at 
least one state, the profit curve for a state can have at most 2S„ segments (where S u represents the number 



of states in T"). This establishes the following claim which completes the proof of theorem 5.1 



Claim D.8 The profit curve V U {C U ) for any state u can have at most 2T, U segments, where T, u is the number 
of states in Tf. 

D.2 Algorithm for Computing the Profit Curve 

The algorithm for computing the profit curve (and hence the ratio index) involves recursively calculating 
the profit curve for a state given the profit curves for all of its successor states. In this process, given the 
profit curve for all subsequent states of a given state u, we begin by constructing an intermediate curve, 
the exploration profit curve of u (X u ()). This curve denotes the optimal profit for any given cost contingent 
upon exploring at u (i.e. e u = 1). Once we have found the exploration profit curve, we then take the concave 
envelope over this curve combined with the abandonment policy and the exploitation policy to find the profit 
curve for a state (see figure [T] for an illustration). 

Under these conditions, we can modify the recursive equation we introduced earlier to calculate the 
profit curve for u (V u {)) into a simpler equation to calculate the exploration profit curve 

Xu{C u ) = max 

vGD(u) 

3 It is possible to have a line segment of the profit curve with multiple transitional states (e.g. in the case of two states with 
identical ratio indices). In this case, we can always look at a sequence of basic feasible solutions such that the transitions are 
ordered and there is at most one transitional state at any point on the segment. 
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Profit Curve as Concave Envelope (h=5) 




Curve 



0.2 0.4 0.6 0.8 1 1.2 1.4 



Figure 1 : The relation between the profit and the exploration-profit curves 

such that: 

l/h+ P nvK<Cu 
v£D(u) 

E u >0 

Recall that E 1 " is the budget allocated to v should we visit it immediately after u, and E u is the vector 
of these budgets for each of the immediate descendants of u. 

For each v G D{u), let B v represent the number of "corner" solutions on V v () and denote the cost of 
the ith such "corner" solution as (where Sq = and s v Bv = 1). We can then create the following modified 
recursive equation to calculate the exploration profit curve. 



S.t. 



max,. E v eD(u) Puv E&IW) " W-lK, 

<=D(u) 



0<e^<l G D(u),i G {1, . . . , B v } 

This is a linear program where each decision variable, e" i( selects some fraction of the ith segment of 
the pr ofit c urve of v. e u represents the collection of all such variables. As the profit curve is concave (by 

claim D.l I, ^"^i^"^ 1-1 ^ < ^^^fiT^" 1 ^ Vi > j. Thus, there exists an optimal solution which only 



-1 "3-1 



assigns > if e" i _ 1 = 1 for any i > 1. 

Further, through inspection we can see that the optimal solution for any C u is to select the segments 

■p v ( s v \— , p v ( s v \ 

in order of decreasing slope (where the slope of the ith segment of V V Q is J — ^-) until all bud- 

i i—1 

get is exhausted. By ordering segments thus, we can easily construct X u {). The algorithm below orders 
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the segments of the elements of D(u) and builds X u (), storing the costs (cj) and budgets allocated to all 
descendants (E% = Yli e ",i) f° r eacn "corner" solution of X u (). 

Algorithm: ComputeExplorationProfitCurve 

1) For all v G D(u),i G {1, B v } 

/^Compute profit, cost, and slope for each line segment of V v ()*/ 

Set *? = P v (sV)-P v (sU) 
Set c\ = s\- s\_ x 
Set M? = </< 

2) Sort the J2 v eD(u) B v elements of the form from largest to smallest 

Let d(j) index the node for the jth largest element in the list 
Let t(j) indicate which segment of Vj this is 

/*M$j) = slope of the jth line segment in the ordered list.*/ 

3) Set So = \ /*fixed cost*/, Xo = /*initial profit*/, £u ; o = G D(u) 
/★initial budgets for descendants*/ 

4) For i = l to E v eD(u) B v 

/*Add next segment to the curve*/ 

/*Compute current total profit (Xj) and cost (Si)*/ 
Let X i = X i - 1 + P u M)*if$ 
Let Si = Si-! + Pu4(i)C d t ^ 

/*Compute budgets allocated to descendants (£ V; i) at the end*/ 
/*of each segment (needed to represent the policy)*/ 

T J- C C I 

Let t-d(i),i — t-dtyj-i + c t (i) 
Let £ Vii = £ Vt i-i / d(i) 

5) Set n = 1, fa = 1 

6) For i = 2 to EveD{u) B v 

/*Find changes in slope on the exploration profit curve*/ 
j £ Xj—Xj— i Xj—i—Xj—2 

n + + 
k n = i 

7 ) For m = 1 to n 

/*Merge together segments of the exploration profit*/ 
/*curve with the same slope*/ 
S m = Sk m , X u (c m ) = Xk m , £v,m = £v,km 

8 ) B* =n 

Algorithm ComputeExplorationProfitCurve represents the exploration profit curve for state u 
by returning the number of line segments of the curve (not including the zero slope segment from cost of to 
Sq = l/h), B*, as well as the cost (Si), profit (X u (Si)), and vector of budgets to allocate to all immediate 
descendants (£" ) corresponding to the endpoint of each segment. 

Given that the profit curves of all descendants are concave, the sorting of line segments in step 2) equates 
to simply interleaving the segments of the different states and can be performed in 0(dT, u log S u ) time using 
a simple min-heap, where d is the maximum number of immediate descendants for a node. After sorting 
these segments, steps 3) and 4) then determine the cost and profit associated with adding each segment to 
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the exploration profit curve. Finally, as there may and likely will be duplicates in the sorted list of slopes, 
steps 5) through 7) merge all segments of the curve with the same slope. 

Given the exploration profit curve, it is much easier to calculate the profit curve for a state. From 



lemma D.l we know that 7^(1) = C{u)- Further, this must be the only "corner" solution corresponding 
to exploiting at u (p u = 1). All other "corner" solutions must thus correspond to exploring at u (e u = 1) 
and thus must correspond to points on X u (). As we know V u () is concave, we can simply take the concave 
envelope of the points (0,V U (0) = 0), (Si, X u (Si))Vi G {1,. ..,£*}, and (1,V V (1) = C{u)). The 
algorithm below does this. 

Algorithm: ComputeProf i tCurve 

1) Find j* = argmaxjgj! .,B*} Rj where Rj = g. 

2a) If Rj* < C( u ) /*The ratio index policy exploits immediately*/ 

Set r(u) = ((u), S f = l,V u (s 1 {) = (( U ),E%(l)=0 Vv 6 D(u), B u = 1 
2b) Else /*The ratio index policy explores*/ 

Set r(u) = Rj.,s? = S r ,V u (sf) = X U (S r ) , E u (1) = 

Set i = l,j = j* + l. 

while Ar M (5j)-^„(Sj_i) CM-^(gj-i) 

/^Greater marginal return to explore than exploit*/ 
q u —a. 

V u (sf +1 ) = X U (S 5 ) 
E u (i + 1) = £.« 
i = i + l J = j + 1 
Set sf +1 = l,V(sf +1 ) = au),B u = i + l. 

Algorithm ComputeProfitCurve runs in 0(dS u ) time. Step 1) computes the value Rj for each 
segment of X U Q, which represents to slope of the line segment from the origin to the point (q, X u {ci)). In 
the event that Rj* < C( n )> the rat i° index policy is to exploit immediately and we are done determining 
the profit curve for u (step 2a). Otherwise, once we have found the ratio index policy, we continue to look 
at higher budget exploration policies to determine subsequent segments of the profit curve (step 2b). The 
quantity E u (i) represents the budgets allocated to each of the immediate descendants at the end of the ith 
segment of the profit curve. These values are only required to represent the actual policy, not calculate the 
ratio index or profit curve of any state. The quantity j s ^ marginal ratio of transitioning from 

sf to exploitation at u. Once the slopes of the segments of X u () are no larger than this, it is optimal to 
transition to exploitation at u. 

As each step above takes at most 0{dT lu log S u ) time, the total time to compute the profit curves for all 
states in the layered DAG is 0(dT, 2 log S). 
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